Data normalization

Initialization

We start the analysis by initializing the packages required for all the analysis performed in this section. We also define the root directory, within which all the input/output operations for this project will be performed. At the end of this document, a detailed software version information is provided for easier reproducibility of the analysis.


Normalization methodology

We use two different normalization strategy to answer specific questions -

  • For a given plate \(i\), the Cytoplasmic and Mitochondrial roGFP2 ratios are normalized by the median of all controls in the same plate i.e for cytoplasm - \[NormCyto(i) = \frac{Cyto_i} {median(Ctrl_i)}\] and for mitochondria, \[NormMito(i) = \frac{Mito_i} {median(Ctrl_i)}\]

This type of control based plate normalization will be suitable in comparing the redox levels in across organells and nutrients since the roGFP2 ratios has been commonly normalized to the plate control.

  • For a given plate \(i\) with the Cytoplasmic and Mitochondrial roGFP2 ratios \(j\) are normalized by their respective plate specific median roGFP2 ratios i.e for cytoplasm - \[NormCyto_{i} = \frac{Cyto_{ij} - median(Cyto_{i})} {mad(Cyto_{i})}\] and for mitochondria, \[NormMito_{i} = \frac{Mito_{ij} - median(Mito_{i})} {mad(Mito_{i})}\]

This type of organelle specific normalization will be suitable in comparing the redox levels in each organelle across all plates. Since this type of normalization tells us how far are the values from the plate median.

Furthermore, for each normalization strategy, we also summarized the quadruplicated values per gene by taking the median of the scaled values. Due to our outlier removal strategy in the previous section, many NA values were introduced. Each mutant has 4 values, we removed mutants with 2 or more NA values and those with only one NA value was substituted with the median of the other 3 observed values. Next, there were also multiple copies of the same mutants (genes) in either the same plate or different plates. We summarized them by taking the set of quadruplicated or median summarized values which have the maximum absolute median value.

Eventually, we generate the following -

  • A table with all the replicate values per organelle and nutrient conditions and the the median summarized value from all plates
  • A matrix with gene/mutants in rows and all combinations of organelle, nutrient and replicates (and median summarized value) in the column

Below is the table of cleaned raw data that we will use in our analysis.

rawDatCleaned = readRDS(paste0(path, "data/workspaces/YeastMutantRedox_RawDataCleaned.RDS"))

normalizeScreenData = function(normMethod) {
    normDat = split(rawDatCleaned, rawDatCleaned$Type)
    normDat = lapply(normDat, function(x) {
        split(x, x$Plate)
    })
    
    normDatReps = normDat
    
    for (i in 1:length(normDat)) {
        for (j in 1:length(normDat[[i]])) {
            tmp = droplevels(normDat[[i]][[j]])
            
            #---------------------------------------------------------------------------------------------
            # Separate Controls, Cytoplasam and Mitochondrial roGFP2 ratios per plate
            #---------------------------------------------------------------------------------------------
            
            # Control
            tmp.ctrl = droplevels(tmp[tmp$Content == "Control", ])
            
            # Cytoplasm
            tmp.cyto = droplevels(tmp[tmp$Content == "Cytoplasm", ])
            
            # Mitochondria
            tmp.mito = droplevels(tmp[tmp$Content == "Mitochondria", ])
            
            rm(tmp)
            
            #---------------------------------------------------------------------------------------------
            # Compute the corresponding normalization factors (median & median absolute
            # deviation)
            #---------------------------------------------------------------------------------------------
            
            # Normalizing factor - Control
            nf.ctrl.med = median(tmp.ctrl$roGFP2.ratio, na.rm = T)
            
            # Normalizing factor - Cytoplasm
            nf.cyto.med = median(tmp.cyto$roGFP2.ratio, na.rm = T)
            nf.cyto.mad = mad(tmp.cyto$roGFP2.ratio, na.rm = T)
            
            # Normalizing factor - Mitochondria
            nf.mito.med = median(tmp.mito$roGFP2.ratio, na.rm = T)
            nf.mito.mad = mad(tmp.mito$roGFP2.ratio, na.rm = T)
            
            #---------------------------------------------------------------------------------------------
            # The two normalization strategies (plate control and median based)
            #---------------------------------------------------------------------------------------------
            
            # Normalization 1 - Plate control based
            if (normMethod == "fracControl") {
                tmp.cyto$roGFP2.ratio = tmp.cyto$roGFP2.ratio/nf.ctrl.med
                tmp.mito$roGFP2.ratio = tmp.mito$roGFP2.ratio/nf.ctrl.med
            }
            
            # Normalization 2 - Plate median based
            if (normMethod == "robustZ") {
                tmp.cyto$roGFP2.ratio = (tmp.cyto$roGFP2.ratio - nf.cyto.med)/nf.cyto.mad
                tmp.mito$roGFP2.ratio = (tmp.mito$roGFP2.ratio - nf.mito.med)/nf.mito.mad
            }
            
            rm(nf.ctrl.med, nf.cyto.med, nf.cyto.mad, nf.mito.med, nf.mito.mad)
            
            #---------------------------------------------------------------------------------------------
            # Normalized data with replicates - coverting from long to wide table format
            #---------------------------------------------------------------------------------------------
            
            #---Cytoplasm---#
            
            tmp.cyto.repl = tmp.cyto %>% select(Gene.Symbol, Plate, Group, roGFP2.ratio, 
                Type, Content) %>% group_by(Gene.Symbol, Group) %>% mutate(pseurep = paste0("roGFP2_ratio_", 
                1:n())) %>% spread(key = pseurep, value = roGFP2.ratio) %>% ungroup() %>% 
                select(-Group) %>% rename(Genes = Gene.Symbol, Nutrient = Type, Organelle = Content)
            
            #---Mitochondria---#
            
            tmp.mito.repl = tmp.mito %>% select(Gene.Symbol, Plate, Group, roGFP2.ratio, 
                Type, Content) %>% group_by(Gene.Symbol, Group) %>% mutate(pseurep = paste0("roGFP2_ratio_", 
                1:n())) %>% spread(key = pseurep, value = roGFP2.ratio) %>% ungroup() %>% 
                select(-Group) %>% rename(Genes = Gene.Symbol, Nutrient = Type, Organelle = Content)
            
            #--Compilation---#
            
            normDatReps[[i]][[j]] = rbind(tmp.cyto.repl, tmp.mito.repl)
            
            # Deleting
            rm(tmp.ctrl, tmp.cyto, tmp.mito, tmp.cyto.repl, tmp.mito.repl)
        }
        rm(j)
    }
    rm(i)
    
    res = lapply(normDatReps, function(x) {
        x = do.call("rbind", x)
        x = as.data.frame(x)
        x$Plate = factor(x$Plate)
        x$Organelle = factor(x$Organelle, levels = c("Mitochondria", "Cytoplasm"))
        x$Nutrient = factor(x$Nutrient, levels = c("Glucose", "Galactose", "Glycerol"))
        rownames(x) = 1:nrow(x)
        return(x)
    })
    
    res = do.call("rbind", res)
    
    #-------------------------------------------------------------------------------------------------
    # Summarizing mutiple mutants (i.e same gene) from the same or different plates
    # Also dropping genes with 2 or more NA values For genes with just 1 NA value,
    # replacing it with the median of the remaining 3 observed values
    #-------------------------------------------------------------------------------------------------
    
    res = res %>% rowwise() %>% mutate(Median_roGFP2_ratio = median(c(roGFP2_ratio_1, 
        roGFP2_ratio_2, roGFP2_ratio_3, roGFP2_ratio_4), na.rm = T), NA_per_row = sum(is.na(c(roGFP2_ratio_1, 
        roGFP2_ratio_2, roGFP2_ratio_3, roGFP2_ratio_4)))) %>% filter(NA_per_row < 
        2) %>% ungroup() %>% mutate_at(vars(starts_with("roGFO2_ratio_")), function(x) ifelse(is.na(x), 
        .$Median_roGFP2_ratio, x)) %>% select(-NA_per_row) %>% group_by(Organelle, 
        Nutrient, Genes) %>% top_n(n = 1, abs(Median_roGFP2_ratio)) %>% ungroup()
    
    #-------------------------------------------------------------------------------------------------
    # Compiling all replicates across organelles and nutrient conditions into a
    # single matrix
    #-------------------------------------------------------------------------------------------------
    
    redoxMat = data.table(res[, c("Genes", "Nutrient", "Organelle", "roGFP2_ratio_1", 
        "roGFP2_ratio_2", "roGFP2_ratio_3", "roGFP2_ratio_4")])
    redoxMat = dcast(redoxMat, Genes ~ Organelle + Nutrient, fun.aggregate = function(x) {
        x
    }, fill = NA, value.var = c("roGFP2_ratio_1", "roGFP2_ratio_2", "roGFP2_ratio_3", 
        "roGFP2_ratio_4"))
    redoxMat = redoxMat[, c(1, 2, 8, 14, 20, 3, 9, 15, 21, 4, 10, 16, 22, 5, 11, 
        17, 23, 6, 12, 18, 24, 7, 13, 19, 25)]
    redoxMat = data.frame(redoxMat, stringsAsFactors = F)
    rownames(redoxMat) = redoxMat$Genes
    redoxMat = redoxMat[, -1]
    
    #-------------------------------------------------------------------------------------------------
    # Compiling the median values across organelles and nutrient conditions into a
    # single matrix
    #-------------------------------------------------------------------------------------------------
    
    redoxMat_median = data.table(res[, c("Genes", "Nutrient", "Organelle", "Median_roGFP2_ratio")])
    redoxMat_median = dcast(redoxMat_median, Genes ~ Organelle + Nutrient, fun.aggregate = function(x) {
        x
    }, fill = NA, value.var = "Median_roGFP2_ratio")
    
    redoxMat_median = data.frame(redoxMat_median, stringsAsFactors = F)
    rownames(redoxMat_median) = redoxMat_median$Genes
    redoxMat_median = redoxMat_median[, -1]
    
    #-------------------------------------------------------------------------------------------------
    # Putting all the results into one
    #-------------------------------------------------------------------------------------------------
    
    res = list(redox_table = res, redox_replicates = redoxMat, redox_median = redoxMat_median)
    rm(normDat, normDatReps, redoxMat, redoxMat_median)
    return(res)
}

normDat.ctrNorm = normalizeScreenData(normMethod = "fracControl")
normDat.Znorm = normalizeScreenData(normMethod = "robustZ")

Normalized data overview

We normalize the data as described above, below is the summary and the distribution of the normalized data -

  • For per plate normalization based on plate specific control
    Genes               Plate            Nutrient           Organelle    
 Length:27694       102    :  564   Glucose  :9192   Mitochondria:13807  
 Class :character   105    :  564   Galactose:9246   Cytoplasm   :13887  
 Mode  :character   110    :  563   Glycerol :9256                       
                    116    :  561                                        
                    113    :  559                                        
                    114    :  559                                        
                    (Other):24324                                        
 roGFP2_ratio_1     roGFP2_ratio_2     roGFP2_ratio_3     roGFP2_ratio_4   
 Min.   :0.000217   Min.   :0.000217   Min.   :0.000217   Min.   :0.00022  
 1st Qu.:1.251178   1st Qu.:1.245491   1st Qu.:1.237680   1st Qu.:1.24057  
 Median :1.569811   Median :1.565550   Median :1.558293   Median :1.55712  
 Mean   :1.628201   Mean   :1.624441   Mean   :1.622153   Mean   :1.61890  
 3rd Qu.:1.939988   3rd Qu.:1.936884   3rd Qu.:1.932582   3rd Qu.:1.92511  
 Max.   :7.431818   Max.   :7.863636   Max.   :7.136364   Max.   :6.84091  
                                                          NA's   :257      
 Median_roGFP2_ratio
 Min.   :0.000217   
 1st Qu.:1.254985   
 Median :1.559173   
 Mean   :1.623041   
 3rd Qu.:1.929933   
 Max.   :7.136364   
                    
  • For per plate normalization based on median organelle specific values
    Genes               Plate            Nutrient           Organelle    
 Length:27695       102    :  564   Glucose  :9192   Mitochondria:13807  
 Class :character   105    :  564   Galactose:9246   Cytoplasm   :13888  
 Mode  :character   110    :  563   Glycerol :9257                       
                    116    :  561                                        
                    113    :  559                                        
                    114    :  559                                        
                    (Other):24325                                        
 roGFP2_ratio_1      roGFP2_ratio_2       roGFP2_ratio_3     
 Min.   :-14.18291   Min.   :-14.180302   Min.   :-17.29949  
 1st Qu.: -0.64326   1st Qu.: -0.674491   1st Qu.: -0.69362  
 Median :  0.03188   Median :  0.002007   Median : -0.02326  
 Mean   :  0.05993   Mean   :  0.035221   Mean   :  0.01129  
 3rd Qu.:  0.69614   3rd Qu.:  0.681601   3rd Qu.:  0.66520  
 Max.   : 27.78831   Max.   : 27.774188   Max.   : 28.68528  
                                                             
 roGFP2_ratio_4     Median_roGFP2_ratio
 Min.   :-9.92714   Min.   :-9.61836   
 1st Qu.:-0.67574   1st Qu.:-0.54478   
 Median :-0.01056   Median :-0.01878   
 Mean   : 0.01211   Mean   : 0.03033   
 3rd Qu.: 0.65572   3rd Qu.: 0.55489   
 Max.   :27.81965   Max.   :27.78125   
 NA's   :255                           
  • Number of mutants per condition
   GalactoseCytoplasm GalactoseMitochondria      GlucoseCytoplasm 
                 4638                  4608                  4622 
  GlucoseMitochondria     GlycerolCytoplasm  GlycerolMitochondria 
                 4570                  4627                  4629 

Sample similarity - Dimensionality reduction

Next, we apply the dimensionality reduction method Multidimesional scaling (MDS) on the roGFP2 ratio values normalized by the plate control and Robust Z normalized to identify the major grouping of the yeast mutants based on their redox status across organelle and nutrient conditions. We distinctly see that the redox status is different between the organelles (also seen by density plots, higher in mitochondria compared to cytoplasm) and within the organelles the mutants group by the nutrient conditions.

getMDSdata = function(dat) {
    mds = cmdscale(dist(t(dat)), eig = TRUE, k = 2)$points
    colnames(mds) = c("MDS1", "MDS2")
    
    col_anno = do.call("rbind", strsplit(rownames(mds), "_"))
    col_anno = col_anno[, -c(1:3)]
    colnames(col_anno) = c("Compartment", "Nutrient")
    mds = data.frame(mds, col_anno)
}

mdsZ = getMDSdata(dat = normDat.Znorm$redox_replicates)
mdsC = getMDSdata(dat = normDat.ctrNorm$redox_replicates)

p1 = ggplot(mdsZ, aes(x = MDS1, y = MDS2)) + theme_classic(base_size = 8) + labs(subtitle = "Robust Z normalized") + 
    geom_point(aes(color = Nutrient, shape = Compartment), size = 2)

p2 = ggplot(mdsC, aes(x = MDS1, y = MDS2)) + theme_classic(base_size = 8) + labs(subtitle = "Plate control normalized") + 
    geom_point(aes(color = Nutrient, shape = Compartment), size = 2)

pA = p1 + p2 + plot_layout(guides = "collect")

p3 <- ggviolin(normDat.Znorm$redox_table, x = "Organelle", y = "Median_roGFP2_ratio", 
    xlab = "", ylab = "Median roGFP2 ratio", draw_quantiles = 0.5, color = "Organelle", 
    facet.by = "Nutrient") + stat_compare_means(label = "p.format", method = "wilcox", 
    cex = 2) + labs(subtitle = "Robust Z normalized") + theme_classic(base_size = 8) + 
    theme(legend.position = "right", axis.text.x = element_blank(), axis.ticks.x = element_blank())

p4 <- ggviolin(normDat.ctrNorm$redox_table, x = "Organelle", y = "Median_roGFP2_ratio", 
    xlab = "", ylab = "Median roGFP2 ratio", draw_quantiles = 0.5, color = "Organelle", 
    facet.by = "Nutrient") + stat_compare_means(label = "p.format", method = "wilcox", 
    cex = 2) + labs(subtitle = "Plate control normalized") + theme_classic(base_size = 8) + 
    theme(legend.position = "right", axis.text.x = element_blank(), axis.ticks.x = element_blank())

pB = p3 + p4 + plot_layout(guides = "collect")

p5 = ggviolin(normDat.Znorm$redox_table, x = "Nutrient", y = "Median_roGFP2_ratio", 
    xlab = "", ylab = "Median roGFP2 ratio", draw_quantiles = 0.5, color = "Nutrient", 
    facet.by = "Organelle") + stat_compare_means(method = "anova", label.y = 22, 
    label.x = 1.5, cex = 2) + stat_compare_means(label = "p.signif", method = "wilcox", 
    ref.group = ".all.", hide.ns = TRUE) + scale_color_manual(values = c("#00BA38", 
    "#F8766D", "#619CFF")) + labs(subtitle = "Robust Z normalized") + theme_classic(base_size = 8) + 
    theme(legend.position = "right", axis.text.x = element_blank(), axis.ticks.x = element_blank())

p6 = ggviolin(normDat.ctrNorm$redox_table, x = "Nutrient", y = "Median_roGFP2_ratio", 
    xlab = "", ylab = "Median roGFP2 ratio", draw_quantiles = 0.5, color = "Nutrient", 
    facet.by = "Organelle") + stat_compare_means(method = "anova", label.y = 5, label.x = 1.5, 
    cex = 2) + stat_compare_means(label = "p.signif", method = "wilcox", ref.group = ".all.", 
    hide.ns = TRUE) + scale_color_manual(values = c("#00BA38", "#F8766D", "#619CFF")) + 
    labs(subtitle = "Plate control normalized") + theme_classic(base_size = 8) + 
    theme(legend.position = "right", axis.text.x = element_blank(), axis.ticks.x = element_blank())

pC = p5 + p6 + plot_layout(guides = "collect")

p = pA/pB/pC

ggsave(filename = paste0(path, "analysis/normalization/roGFP2_ratio_comparison_nutrient_compartments.pdf"), 
    plot = p, width = 7, height = 7)

p


Replicate similarity - Correlation

Below we show the correlation among replicates.

  • first with plate median (nutrient specific) based Robust Z normalization

  • second with control based normalization


Distribution of normalized data

  • For per plate normalization based on plate specific control and an interactive data table to access the data

  • For per plate normalization based on median organelle specific values and an interactive data table to access the data


Session information

R version 3.6.2 (2019-12-12)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS  10.16

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] RColorBrewer_1.1-2 pheatmap_1.0.12    patchwork_1.0.0    ggpubr_0.2.4      
 [5] magrittr_1.5       ggrepel_0.8.1      WriteXLS_5.0.0     data.table_1.12.8 
 [9] forcats_0.4.0      stringr_1.4.0      dplyr_0.8.4        purrr_0.3.3       
[13] readr_1.3.1        tidyr_1.0.2        tibble_2.1.3       ggplot2_3.2.1     
[17] tidyverse_1.3.0    DT_0.12            rmdformats_0.3.6   knitr_1.28        

loaded via a namespace (and not attached):
 [1] httr_1.4.1        jsonlite_1.6.1    modelr_0.1.5      shiny_1.4.0      
 [5] assertthat_0.2.1  cellranger_1.1.0  yaml_2.2.1        pillar_1.4.3     
 [9] backports_1.1.5   lattice_0.20-38   glue_1.3.1        digest_0.6.23    
[13] promises_1.1.0    ggsignif_0.6.0    rvest_0.3.5       colorspace_1.4-1 
[17] htmltools_0.4.0   httpuv_1.5.5      plyr_1.8.5        pkgconfig_2.0.3  
[21] broom_0.5.4       haven_2.2.0       bookdown_0.17     xtable_1.8-4     
[25] scales_1.1.0      later_1.0.0       generics_0.0.2    farver_2.0.3     
[29] ellipsis_0.3.0    withr_2.1.2       lazyeval_0.2.2    cli_2.0.1        
[33] mime_0.9          crayon_1.3.4      readxl_1.3.1      evaluate_0.14    
[37] fs_1.3.1          fansi_0.4.1       nlme_3.1-144      xml2_1.2.2       
[41] tools_3.6.2       hms_0.5.3         formatR_1.7       lifecycle_0.1.0  
[45] munsell_0.5.0     reprex_0.3.0      compiler_3.6.2    rlang_0.4.4      
[49] grid_3.6.2        rstudioapi_0.11   htmlwidgets_1.5.1 crosstalk_1.0.0  
[53] labeling_0.3      rmarkdown_2.1     gtable_0.3.0      DBI_1.1.0        
[57] reshape2_1.4.3    R6_2.4.1          lubridate_1.7.4   fastmap_1.0.1    
[61] stringi_1.4.5     Rcpp_1.0.3        vctrs_0.2.2       dbplyr_1.4.2     
[65] tidyselect_1.0.0  xfun_0.12        

Ashwini Kumar Sharma, PhD

2021-05-21